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Abstract. We consider Hidden Markov Chains obtained by passing a Markov 
Chain with rare transitions through a noisy memory less channel. We obtain 
asymptotic estimates for the entropy of the resulting Hidden Markov Chain as 
the transition rate is reduced to zero. 



Let (X„) be a Markov chain with finite state space S and transition matrix 
P(p) and let (Y n ) be the Hidden Markov chain observed by passing (X n ) through 
a homogeneous noisy memoryless channel (i.e. Y takes values in a set T, and there 
exists a matrix Q such that P{Y n = j\X n = i, X™" 1 , X™ + 17 y™" 1 , Y^) = Qy). 
We make the additional assumption on the channel that the rows of Q are distinct. 
In this case we call the channel statistically distinguishing. 

We assume that P(p) is of the form / + pA where A is a matrix with negative 
entries on the diagonal, non-negative entries in the off-diagonal terms and zero 
row sums. We further assume that for small positive p, the Markov chain with 
transition matrix P(p) is irreducible. Notice that for Markov chains of this form, 
the invariant distribution (ir^i^s does not depend on p. In this case, we say that 
for small positive values of p, the Markov chain is in a rare transition regime. 

We will adopt the convention that H is used to denote the entropy of a fi- 
nite partition, whereas h is used to denote the entropy of a process (the en- 
tropy rate in information theory terminology). Given an irreducible Markov chain 
with transition matrix P, we let h(P) be the entropy of the Markov chain (i.e. 
h(P) = — jftiPij log Pij where tt, is the (unique) invariant distribution of the 
Markov chain and as usual we adopt the convention that OlogO = 0). We also 
let -ffchan(«) be the entropy of the output of the channel when the input symbol 
is i (i.e. i/ c han(*) = — YljeT Qij ; l°gQy)' Let h(Y) denote the entropy of Y (i.e. 
h(Y) = - lim^oo % E we r- = w) log P(F 1 Ar = w)). 

Theorem 1. Consider the Hidden Markov Chain (Y n ) obtained by observing a 
Markov chain with irreducible transition matrix P(p) = I+Ap through a statistically 
distinguishing channel with transition matrix Q. Then there exists a constant C > 
such that for all small p > 0, 

(1) h{P(p)) + nH chan {i) ~C P < h(Y) < h{P{p)) + J2 KiH chan {i), 

i i 

where (iCiji^s is the invariant distribution of P(p). 
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If in addition the channel has the property that there exist and j such that 
Pa' > 0, Qij > and Qi'j > 0, then there exists a constant c > such that 

(2) h(Y) < h(P(p)) + J2 nH chan {i) - cp. 

i 

The entropy rate in the rare transition regime was considered previously in the 
special case of a 0-1 valued Markov Chain with transition matrix P{p) — ( 1 T p p %- p ) 
and where the channel was the binary symmetric channel with crossover probability 
e (i.e. Q = ( ~ e il e ))- It is convenient to introduce the notation g(p) = — plogp — 
(1 — p) log(l — p). In [4], Nair, Ordentlich and Weissman proved that g(e) — (1 — 
2e) 2 plogp/(l — e) < h(Y) < g(p) + g(e). For comparison, with our result, this is 
essentially of the form g(e) + a(e)g(p) < h(Y) < g(p) + g(e) where a(e) < 1 but 
a(e) — » 1 as e — > (i.e. h(Y) — g(p) + g(e) — 0{p\ogp)). A second paper due to 
Chigansky 1 shows that g(e) + b(e)g(p) < h(Y) for a function 6(e) < 1 satisfying 
b(e) -> 1 as e - >l/2 (again giving an 0{p\ogp) error). Our result states in this 
case that there exist C > c > such that g(p) + g(e) — Cp < h(Y) < g(p) + g(e) — cp 
(i.e. h(Y) = g(p)+g(e)-Q(p)). 

We note that as part of the proof we attempt a reconstruction of (X n ) from 
the observed data (Y n ). In our case, the reconstruction of the nth symbol of X n 
depended on past and future values of Y m . A related but harder problem of filtering 
is to try to reconstruct X n given only Yj" . This problem was addressed in essentially 
the same scenario by Khasminskii and Zeitouni [3] , where they gave a lower bound 
for the asymptotic reconstruction error of the form Cp \ log p| for an explicit constant 
C (i.e. for an arbitrary reconstruction scheme, the probability of wrongly guessing 
X n is bounded below in the limit as n — > oo by Cp|logp|). Our scheme shows 
that if one is allowed to use future as well as past observations then the asymptotic 
reconstruction error is 0(p). This was previously observed by Shue, Anderson and 
DcBruyne in [5] who used a similar scheme to ours. 

Before giving the proof of the theorem, we discuss the strategy. We start from 
the equality 

(3) h(X) + h(Y\X) = h(X, Y) = h(Y) + h(X\Y). 

Since h(X) and h{Y\X) are known to be h(P(p)) and J^i 7r i^chan(*)i the estimates 
for the entropy of Y are obtained by estimating h(X\Y). The inequality (JTJ) is 
equivalent to showing that < h(X\Y) < Cp for some C > 0. The lower bound here 
is trivial, whereas the main part of the proof is the upper bound for h{X\Y) (giving a 
lower bound for h(Y)). The second part of the proof, showing ([2]) lowering the upper 
bound for h(Y) under additional conditions, is proved by showing h(X\Y) > cp for 
some c > 0. 

We explain briefly the underlying idea of the upper bound h(X\Y) = 0(p). 
Since the transitions in the (X n ) sequence are rare, given a realization of (Y n ), 
the Y n values allow one to guess (using the statistical-distinguishing property) the 
X n values from which the Y n values are obtained. This provides for an accurate 
reconstruction except that where there is a transition in the X n 's there is some 
uncertainty as to its location as estimated using the Y n 's. It turns out that by 
using maximum likelihood estimation, the transition locations may be pinpointed 
up to an error with exponentially small tail. Since the transitions occur with rate 
p, there is an 0(p) entropy error in reconstructing (X n ) from (Y n ). 
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We make use of a number of notational conventions, some standard and others 
less so. Firstly we shall write denote events by set notation so that {Xq — X2} de- 
notes the event that the random variables Xq and X2 agree. We make extensive use 
of relative entropy. For two partitions P and Q, the relative entropy is defined by 
H(Q\P) = H(W Q) — H(P). When conditioning, we shall not distinguish between 
random variables and the partitions and tr-algebras that they induce (so that for 
example H{X™- 1 ) is - J2wes» PpC^ = w)\ogP{X^~ 1 = w) and H(X \Y) is 
the conditional entropy of X relative to the c-algebra generated by {Y n : n 6 Z}). 
On the other hand if A is an event, we use H(P\A) to mean the entropy of the par- 
tition with respect to the conditional measure P^(-B) = V(AnB)/¥(A). For jointly 
stationary processes (X n ) n( zz and (Y n ) ne z, the relative entropy of the processes 
is given by h(Y\X) = h((X n , Y n ) n& ) - h{{X n ) neZ ) = Um^^ oo (l/iV)(if(X w - 1 V 
F^- 1 ) - H(X^j) = lim^ oo (l/Ar)^(y ^- 1 |Xf° oo ) = H(Y \X^,YZI). 

Given a measurable partition Q of the space, an event A and a er-algebra T we 
will write H(Q\F\A) for the entropy of Q relative to T with respect to P^. In the 
case where A is J 7 - measurable (as will always be the case in what follows) , we have 

H{Q\T\A) = [ F(S|jr)logF(S|JF) ) aT A . 

J \ BSC / 

If Ai, . . . , Ak form an ^-measurable partition of the space, then we have the 
following equality: 

k 

(4) ff(a|j-) = E p (^-MQim-)- 

3=1 

Proof of Theorem^ Note that ((X n , Y n )) n£ z forms a Markov chain with transition 
matrix P given by P(i,j),ti' ,j') = Pu'Qi'j' and invariant distribution ^(i.j) = T^iQij- 
The standard formula for the entropy of a Markov chain then gives h(X, Y) — 
h(P(p)) +T, i nH ckan (i). Since h(X,Y) = h(Y) + h(X\Y), one obtains 

(5) h(Y) = h(X, Y) - h{X\Y) = h(P{p)) + ^H chan (i) - h{X\Y). 

i 

This establishes the upper bound in the first part of the theorem. 

We now establish the lower bound. We are aiming to show h(X\Y) = 0(p) (for 
which it suffices to show H(X ~ 1 \Y) ~ O(Lp) for some L). Setting L — \ logp| 4 and 
letting P be a suitable partition, we estimate H(Xq^ 1 \Y, P) and use the inequality 



(6) H{X^- l \Y) < HiX^lY^ + HiP). 

We define the partition P as follows: Set K — \ logp| 2 and let P — {E m , E hl E g i , E g2 }. 
Here E m (for many) is the event that there are at least two transitions in Xq -1 , 
£"b (for boundary) is the event that there is exactly one transition and that it takes 
place within a distance K of the boundary of the block and finally E g (for good) is 
the event that there is at most one transition and if it takes place, then it occurs at 
a distance at least K from the boundary of the block. This will later be subdivided 
into E g i and E g 2- 

If E m holds then we bound the entropy contribution by the entropy of the equidis- 
tributed case whereas if E^ holds, there are 2if |S'|(|S'| — 1) = 0{K) possible values 
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of Xq 1 . This yields the following estimates: 



(7) F(E m ) = 0(p 2 L 2 ) = o(p) 

(8) H(X^ 1 \E m ) < ilog IS"] 

(9) F(E h ) = 0{ P K) 

(10) H(X^- 1 \E h ) = 0(logK). 



It follows that F(E g ) = 1 — 0(pK). Given that the event E g holds, the sequence 
X^ 1 belongs to B = {a L : a G S} U {a'b^ : a,b € S, K < i < L - K}. 

Given a sequence u G B, the log- likelihood of u being the input sequence yield- 
ing the output Y^ 1 is L^Yf' 1 ) = Y%=a lo S Q« 4 y 4 • We defme z o^ to be the 
sequence in B for which Lz^Yq" 1 ) i s maximized (breaking ties lexicographically if 
necessary). We will then show using large deviation methods that when E g holds, 
Zq -1 is a good reconstruction of Xq _1 with small error. 

We calculate for u,v G B, 

F{L v {Y L - l )>L u {Y, L - l )\X^- l ^u) 
=P (^logCQ^/Q^Yj > OIX^- 1 = u ) , 

\i6A / 

where A = {i: m ^ u,}. For each i G A, given that -X"^ -1 = u, we have that 
log(<5ii i Y i /Q« i Y i ) is an independent random variable taking the value log(Q Vi j / 'Q Ui j) 
with probability Q Ui j- 

It is well known (and easy to verify using elementary calculus) that for a given 
probability distribution ir on a set T, the probability distribution a maximizing 
Y]j eT nj \og(uj /ttj) is a = n (for which the maximum is 0). Accordingly we see that 
given that X^ 1 = u, L v {Yq~ 1 ) — L U (Y^ 1 ) is the sum of |A| random variables, 
each having one of |S'|(|S'| — 1) distributions, each with negative expectation. It 
follows from Hoeffding's Inequality [5] that there exist C > and -q < 1 independent 
of p such that P(L„(y L - 1 ) > L„(Y L - 1 )|X L " 1 = u) < C??I A I . 

We deduce that for u, v G B 

(11) P(Z^ _1 = vlX^ 1 =u)< C^ u - V \ 

where S(u, v) is the number of places in which u and v differ. 
We split Eg into two subsets: 

E sl =E g n {8{X^-\ Zfr- 1 ) < K}; and 

E g2 = E g n{S(x^-\zt 1 )>K}. 

Since there are less than \S\ 2 L elements in B, we see using (jTTJ) and recalling that 
K = \\ogp\ 2 that 

(12) F(E g2 ) < \S\ 2 LCif = o{p) 

(13) J ff(X L - 1 |S g2 )<log(|5| 2 L). 
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Combining (H) with © and we see that ¥>{E gl ) = 1 - 0{pK). We then 
obtain 

(14) H(V) = 0{pK \og{pK)) = o{ P L). 

Conditioned on being in E g i, if Zq = Zl-i then X^ 1 = Zq~ x so we have 

(15) H(X^~ 1 \Z^- 1 V V\E sl n {Z = Z L ^}) = 0. 

Given that E gl holds, if X^ 1 = a'b 1 "- 1 then Z^ 1 must be of the form a j b L ~ j 
for some j satisfying — K < j — i < K . Denote this difference j — i by the random 
variable N. We have 

Hixt 1 ^- 1 VV\E gl n {Zq ± Zl^}) 
<h{X%- 1 \z%- 1 V V\E gl n {Z ± z L ^}) 
=H(N\Z^ 1 V V\E gl n {Z + z L ^}) 
<H(N\E gl r\{Z ^Z L _ 1 }). 

where the first inequality follows because Zq -1 is determined by Y^' 1 so the 
partition generated by Yq L_1 is finer than that generated by Zq -1 ] and the equality 
follows because given Zq~ 1 and conditioned on being in E g ±, knowing N is sufficient 
to reconstruct Xq _1 so the partition generated by N is the same as the partition 
generated by Xq -1 . 

Since E gl n {Zq ^ Z L _i} = E gl n {A ^ A L _i}, we have for |fc| < K, P(A = 
k\E gl n {Z Q ^ Z L _i}) = ¥(N = k\E gl n {A ^ From (HI]) this is bounded 

above by Cry' fc L Since a distribution with these bounds has entropy bounded above 
independently of p, it follows from this that H(N\E g \ n {Zq ^ Zl-i}) = O(l) and 
hence that 

(16) i/(x L - 1 |y i - 1 v P|£? gl n {z ^ z L ^}) = o(i). 

Finally we have P{E gl n {Z ^ Zl-i}) = 0(p£) 

We now have H{X^- 1 \Y Q 1 - 1 ) < H{X^- l \Y L - 1 V V) + H(V). We estimate the 
right side using ((4]), splitting the space up into the sets E^, E m , E g2 , E gl n {Zq = 
Z L _i} and E g i n {Z ^ All of these sets are Y^ 1 V measurable. Calcu- 

lating the contribution to the entropy from each of the sets, each part contributes at 
most 0(pL) yielding the estimate H(Xq~ 1 \Yq~ 1 ) = 0{pL), so that h(X\Y) = 0{p) 
as required. This completes the first part of the proof. 

For the second part of the proof, suppose that the additional properties are satis- 
fied (the existence of i, i' and j such that Pu> > 0, Qij > and Qi'j > 0). We need 
to show that h(X\Y) > cp for some c > or equivalently that H(Xq\Y, Xzlo) > cp. 
In fact, we show the stronger statement: H(Xq\Y, (X n ) n ^ Q ) > cp. Let A be the 
event that X—i = i and X\ = i' and Yq = j. We now estimate H(Xq\Y 1 (X n ) n =io\A). 

For x £ A, we have 



¥(X =i\Y,{X n ) n &){x) 
F(X = i'\Y,{X n ) n ^ )(x) 



Pii Pii' Qi 



PiiPii'Qij H~ Pii' Pi' i' Q i' j H~ ^2k^{i.i'} PikPki' Qkj 
Pit' P%'i' Qi' j 

PiiPii'Qij H~ Pii' Pi' i' Q i' j H~ z!j/c0{i,£'} PikPki' Qkj 



As p -> 0, we have P(A = i\Y, (X n ) n ^ )(x) -> Q tj /{Qij + Qi'j) and P(A = 
i'\Y, {X n ) n ^Q)(x) -> Qv j / {Qij + Qi<j). From this we see that H(X \Y, (X n ) n ^ a )\A) 
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converges to a non-zero constant as p —y 0. Since A has probability 0(p), applying 
Q we obtain the lower bound h(X\Y) > cp. From this we deduce the claimed 
upper bound for h(Y): 

h(Y) < h[X) + ^TTiifchanW " C P . 

i 

In this case we therefore have h(Y) = h(X) + Yli 7r i^chan(*) + ©(?>)• This 
completes the proof of the theorem. 

□ 

References 

1. P. Chigansky, The entropy rate of a binary channel with slowly switching input, Available on 
arXiv: cs/0602074vl, 2006. 

2. W. Hocffding, Probability inequalities for sums of bounded random variables, J. Amer. Statist. 
Assoc. 58 (1963), 13-30. 

3. R. Z. Khasminskii and O. Zeitouni, Asymptotic filtering for finite state Markov chains, Sto- 
chastic Process. Appl. 63 (1996), 1-10. 

4. C. Nair, E. Ordentlich, and T. Weissman, Asymptotic filtering and entropy rate of a hidden 
Markov process in the rare transitions regime, International Symposium on Information Theory, 
2005, pp. 1838-1842. 

5. L. Shue, B. Anderson, and F. DeBruyne, Asymptotic smoothing errors for hidden Markov 
models, IEEE Trans. Signal Processing 48 (2000), 3289-3302. 

Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA 
E-mail address: peres (a)microsof t . com 

Department of Mathematics and Statistics, University of Victoria, Victoria, BC 
V8W 3R4, CANADA 

E-mail address: aquas (a)uvic . ca 



